Bayes-optimal performance in a discrete space 
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Abstract 



We study a simple model of unsupervised learning where the single symmetry breaking vector 
has binary components ±1. We calculate exactly the Bayes-optimal performance of an estimator 
which is required to lie in the same discrete space. We also show that, except for very special 
cases, such an estimator cannot be obtained by minimization of a class of variationally optimal 
potentials. 



Statistical mechanics techniques have been used with success to study and understand key prop- 
erties of inferential learning pi, B . This approach provides explicit and detailed results that are in 
many ways complementary to the more general results obtained by statistics. The case of non-smooth 
^■D problems, in which the parameters that have to be estimated take discrete values, is of particular 

interest. On the one hand, many of the results from statistics can no longer be applied, while on 
the other hand, the estimation of these parameters is often a computationally hard problem. In this 
paper, we present a detailed analysis of a simple model of unsupervised learning p, 0, H, o, m, H, 
involving a single symmetry breaking vector with binary components ±1 and highlight the differences 
with the case of smooth components. In particular we compare the results from Gibbs learning and 
Bayes learning with the ones for the best binary vector and a vector which minimizes a variationally 
optimal potential. 

The problem is as follows: p A^-dimensional real patterns {£**, p, — 1, ...,p}, are sampled indepen- 
dently from a distribution P(£ M |B) ~ <5(£ M • ^ - N) exp -U (b ■ £ M /\/iV) with a single symmetry 
breaking direction B. The function U modulates the distribution of the patterns along B. We will 
£j , focus on the properties in the thermodynamic limit N — * oo, p — > oo with a — p/N finite. One then 

finds that the normalized projection t = B-£/viV is distributed according to (A/ - being a normalization 
constant) 



P*(t) = _|= exp j-i- -£/(*)} , (1) 

while projections on any direction orthogonal to B arc normal. The case of a so-called spherical 
prior, in which B is chosen at random on the sphere with radius y/N, was discussed in [|[ Q, ||. As 
announced earlier, we focus here on the more complicated situation in which the components of B 
take binary values ±1. The prior distribution is now given by: 
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p(b) = n(B) = n 



w r U, . 1 



-(KB,- - 1) + -*(Bj + 1) 



(2) 



The goal of unsupervised learning is to give an estimate J of B. One way to do so is to sample 
J from a Boltzmann distribution with Hamiltonian 7i(J) = X)u=i ^(^ M )i with A M = J • ^/y/N, at 
temperature T = /3 -1 , for an appropriate choice of the ad- hoc potential V M. The properties of such 
a J-vector can be extracted from the partition function : 

d3P b (J)e'^ J K (3) 

The latter is a fluctuating quantity due to the random choice of the patterns, but the free energy per 
component / = — (N/3)~ 1 hiZ is expected to be self-averaging in the thermodynamic limit and can 
therefore be calculated by averaging over the pattern distribution with the aid of the replica trick |lfj . 
Assuming replica symmetry (RS), one finds 

/ =- Extr I -(l-q)q + RR- Vz In cosh (z V? + #) (4) 

P R,q,R,q 2 J V ' 



-a I V*t I Vt' In / ; dX exp ( - W ) ( " " ' ^^ ~ ^ ' 



V2tt(1 - q) "V 2(l-<?) /J 

where V*t = dtP*(t) and Vt' — dt' (2tt)^ 1 / 2 exp(— 1' 2 /2). The extremum operator gives saddle 
point equations which determine the self- averaging value of the order parameters. As usual q can 
be interpreted as the typical mutual overlap between two samples J and J', q — J • J'/iV, while 
the performance R measures the proximity between the estimate J and the "true" direction B, R = 
J • B/iV. For even functions U, there is no distinction between B and — B, and a symmetry R — > —R 
arises. In the following, only R > will be considered. 

As a first application of eq. ||, we turn to Gibbs learning [[JT| |lj, ||. It corresponds to sampling 
from the posterior distribution and is realized by taking (3—1 and V = U in eq. H (for more details, 
see [pi; for the estimation of U, see [13J). In agreement with the fact that one cannot make a statistical 
distinction between B and its Gibbsian estimate J, one finds that the order parameters satisfy qc = Rg 
and qc — Rg, where the subscript G refers to Gibbs learning. This observation allows to simplify the 
saddle point equations further, and the Gibbs overlap is found to obey the following equation: 



with 



and 



Rg = F 2 b [TUR g )) (5) 



F B (x) = J Vz tanh (zx + x 2 ) and T (R) = J a / Vt — ^— - (6) 



X{t;R) = f Vt' x e -u{m+VT=Wt>) Y{t . R) = }_^ x{t . R) 



(7) 



Note that F B comes from the entropic term of the free energy and does not depend on U, as opposed 
to T, X and Y. 

For Rg small, one obtains from eqs. BM upon assuming a smooth behavior as a function of a, 
that {fV*tf(t) = (/(«)> J: 

(t).^0 => R G -a (t)l (8) 

(*>. = =► R G { Z% \ a l aG 0) 

I ~C{a-a G )-, a>a G 



with critical load ac = (l — (i 2 )*) • These results are identical to those for a spherical prior M. In 
particular, one observes the appearance of retarded learning when the distribution has a zero mean 
along the symmetry breaking axis. In the regime Rq — * 1, on the other hand, one finds an exponential 
approach : 



1 - R G {a) 



2a (([/') 



■ exp 



^n 



(10) 



where U' = dll(t)/dt. This is now different from the case of a spherical prior, where the approach 
is following an inverse power law 1 — Rq ~ or 1 J8|. The difference becomes even more pronounced 
when U has singular derivatives, as is typically the case when a supervised problem is mapped onto 
an unsupervised version ||. Then one finds that Rq — 1 is attained at a finite value of a while 
1 — Rq ~ a~ 2 for a spherical prior, see [O and M for an explicit example. 

Apart from its intrinsic interest, Gibbs learning is also directly related to the Bayes optimal 
overlap by Rb = \/Rg, see [0 @J1 • This over l a P is realized by the center of mass Jg of the 
Gibbs ensemble. A simple reasoning P4 m shows that 3 b maximizes the overlap R averaged over the 
posterior distribution of B. In order to exclude the case 3b — (which would follow in the presence of 
the symmetry B — > — B), we will implicitly assume an infinitcsimally small symmetry breaking field 
in the Gibbs distribution. 

Using the self-averaging of the mutual overlap, with qc = Rq, the explicit form of J^ is found to 



be 3} 



R. 



-1/2, 



<. G * Z~ x J d3 Pb(3) 3 exp{— J^ U(X fi )}. In general, the components of this center of mass 
are continuous, while our prime interest here is in the optimal performance attainable by a binary 
vector. The latter vector, which we will denote by 3bb (for best binary), can fortunately be easily 
obtained M: it is the clipped version of the center of mass 3b, with components (3bb)j = sign ((3b)j)- 
To evaluate the overlap between 3bb and B, we recall the following general result for the overlap 
R = 3 ■ B/JV of a vector J with transformed components Ji = y/~Ng(Ji)/ \/J2i9 2 (Ji) (with g odd and 
B binary) as a function of the overlap R of J with B (see II for details) : 



R 



J P(x) g(x) dx 

~ 7TJ2 



(11) 



[JP(x)g 2 (x)dx]- 

where P(x) is the probability density for x = J\P>\ , which for the prior distribution eq. @ is independent 
of the index due to the permutation symmetry among the axes. If J is sampled from a spherical 
distribution (with B binary), then P(x) is found to be a Gaussian H with mean R and variance 
1-R 2 . 

In order to obtain P(x) corresponding to the center of mass 3b, we evaluate the quenched moments 
of y = x\/Rg'- 



iv" 



^((z- 1 f d3P b (3)e-^» U(X " ) J 1 B 1 



(12) 



The average (. . .) over the quenched pattern set can be performed by the replica trick with the following 
replica symmetric result: 



0/ 



Vz 



(13) 



where Rq, which is determined by the saddle point equations of Gibbs learning, cf. eq. o, is found 
to be Rq — !F 2 {y/RG). Recognizing eq. O as a transformation of variables y — tanh (zvRq + Rq 
with z normally distributed, one concludes Jl8| : 

n 21 



P(x) 



/Rq 



-1 



\Z2-kR g {1-Rqx 2 ) 



exp 



2R G 



1 ln ( 1 + VRgx 

2 n \l-<jRo~x 



Rq 



(14) 



By applying eq. O, for g{x) — sign(x), with P(x) given by eq. H, one finally obtains the following 
overlap Rbb = J&& • B/iV of the best binary vector : 



R bb = l-2H(F B l {R B )) = l-2H{T{R B )) , (15) 

where H[x) = J Dt. Eq. H3 is a central result of this paper, providing an upper bound for the 
performance of any binary vector. The asymptotics of R b b can be obtained from those of Rg 
yielding 



R B> 



Rbb 



fic^O 



2R G 



in the poor performance regime, and an exponential behavior in the limit of Rg — ► 1: 

2 
1 — Rbb — — (1 — #g) ■ 



(16) 



(17) 



We note that another quantity of interest, the mutual overlap T = 3 b ■ Jbb/N between center of mass 
and best binary, can also be evaluated quite easily, leading to the simple result T = Rbb/RB- In the 
limit Rq — > one recovers T — > y2/7r, which is the result for the overlap between a vector sampled 
at random from the iV-sphere and its clipped counterpart. T, Rb and Rbb are plotted as functions of 
Rg in ng. |. 
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Figure 1: T, Rb and i?b& parametrized by Rg, according to eqs. || and fL5[ 



We finally turn to the problem of a variationally optimized potential. In the case of a spherical 
prior, it was shown that the Bayes-optimal performance can indeed be attained by a vector that 
minimizes this potential |15|, ||, H3 [ij] ■ We now address the question of whether the same procedure 
is successful in discrete space, a problem which has been also studied in |]1| for the supervised scenario. 
Since 3bb is a unique optimal binary vector, one would like the desired potential to satisfy both R = Rbb 
and q = 1. Proceeding again from the free energy eq. for a general potential V, taking the limits 
q — -> 1, j3 — » 00 with finite c = (3(1 — q), and rescaling the conjugate parameters c = q//3 2 , y = -R//3, 
one obtains the following saddle point equations: 



R= 1 



2H{ 4= 



2 / y 

— exp - — 

7TC \ IC 



(18) 



J VtX(t;R)[\ (t,c)-t} 2 y=- c \ VI Yll: R)[X,Al.r) - I . 



. The variational optimization of R with respect to the 



where X (t, c) = Argmin A [V(A) + (A - t) 2 /2c 

choice of V can now be performed as in refs. 15|, |8], lid, 17 invoking the Schwarz inequality. We only 

quote the final result for the resulting overlap R op t at the minimum of this optimal potential: 

R opt = l-2H(f{R opt )) . (19) 

The important issue to be examined is whether or not R opt (a) saturates the bound given by the best 
binary. By comparison of eq. |l9| with eq. [15|, one immediately concludes that this is not possible, as 
long as T is not a constant nor singular, since R ov t — Rbb would imply that J- (Rbb) = J-{Rb), and 
Rbb = Rb is excluded by the first equality in eq.|l5j. In general one thus has that R op t < Rbb, since 
dT jdR > 0. The equality is reached in asymptotic limits and for a special case (see below). For 
Ropt ~ one has: 

(t),^0 =► Ropt*\{tU^ (20) 

(*}=0 =* r\ =0 > / — . a ~ ac , (21) 

W * \ ~y/C'(a-a c ), a>a c ' y ' 

where the critical value now is a c = Trac/2. Furthermore, the approach R op t — > 1 is identical to that 
of Rbb, 1 — R op t — 1 — Rbb- Therefore V op t is successful only in the asymptotic limits a —* and 
a — > oo. Note that the second order phase transition in eq. El] occurs at a larger value of a than for 
Gibbs learning. 

The case F{R) independent of R, implying R op t = Rbb, Va, arises in a simple Gaussian scenario 
with a linear function U [M. In this case the best binary corresponds to clipped Hebbian learning. 
This seems to be the only case in which minimization of an optimal potential reproduces the best 
binary vector. We conclude that an optimal potential saturating the Rbb bound with q — ► 1 cannot be 
constructed, in general. It motivates the search for alternative methods in discrete optimization. The 
main issue is to find new ways to incorporate information about the binary nature of the symmetry 
breaking vector, other then simply imposing the same binary constraint in the solution space. An 
interesting approach would be to try to construct a suitable potential for the continuous center of 
mass 3b from which the best binary could be obtained by clipping. Whether such an approach is 
possible will be answered in future work. 
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